병렬 프로그래밍 개요#
강좌: 수치해석 프로젝트
개요#
반도체의 성능은 Moore의 법칙 수준으로 향상되어 왔음
단일 코어의 성능 향상 폭은 줄어들음
여러 컴퓨터를 묶은 클러스터 방식의 슈퍼컴퓨터
Multi-core, Many-core 프로세서를 활용한 병렬 계산
GPU를 이용한 인공지능 학습
컴퓨터 구조#
폰 노이만 구조
CPU, 메모리, 저장장치, 네트워크 등으로 구성됨
Fig. 14 Von Neumann architecture (From Wikipedia)#
CPU : ALU 와 CU, 그리고 캐시로 구성됨
멀티코어 프로세서
Fig. 15 Dual Core Processor (From Wikipedia)#
SIMD (Single instruction, multiple data) : 벡터 계산 (MMX, SSE, AVX, neon)
Fig. 16 SIMD (From Wikipedia)#
Memory : 메모리 속도는 상대적으로 덜 빨라짐
종류 : DDR, GDDR, HBM
Network
라우터, 선, 카드
종류 : Ethernet (1G, 10G), Omnipath, Infiniband
병렬 프로그래밍 모델#
Message Passing Model#
각 프로세스가 독립된 메모리를 가지고 있으며 통신으로 자료 교환하면서 병렬 계산
라이브러리 : MPI (MPICH, OpenMPI, MS-MPI, Intel MPI)

Fig. 18 Message Passing model (From KSC)#
Dead lock을 조심해야 함
병렬 계산 성능#
Amdahl’s law#
전체 코드 중 \(p\) 만큼만 병렬화해서 \(N\) 배 빨라졌을 경우 총 성능 향상은 \(S\) 임.
Fig. 19 계산 성능 비교 (From Wikipedia)#
Python 병렬 프로그래밍#
Numba#
prange
를 이용한 Loop 자동 병렬 기능 제공Threading layer에 따라 OpenMP, Intel TBB 등을 제공
mpi4py#
MPI 라이브러리 바인딩
예제#
Laplace 코드를 Fork and Join model로 병렬화 하시오
import numba as nb
import numpy as np
# Use OpenMP
from numba import config
config.THREADING_LAYER = 'omp'
# For Intel MKL as BLAS and LAPCK
import mkl
def solve_laplace(n, solver, tol=1e-5, order='C'):
"""
Laplace Equation solver
Parameters
----------
n : integer
size
solver : function
iterative solver
tol : float
tolerance
order : string
'C' | 'F'
Returns
-------
err : float
residual
"""
ti = np.zeros((n+2, n+2), order=order)
dt = np.zeros((n+2, n+2), order=order)
def bc(t):
t[-1, 1:-1] = 300
t[0, 1:-1] = 100
t[1:-1, -1] = 100
t[1:-1, 0] = 100
err = 1
hist = []
while err > tol:
# Apply BC
bc(ti)
# Run Gauss-Seidel
solver(n, ti, dt)
# Compute Error
err = np.linalg.norm(dt) / n
return err
@nb.njit(fastmath=True)
def jacobi_nb(n, ti, dt):
"""
Jacobi method
Parameters
----------
n : integer
size
ti : float
current time
dt : array
difference
"""
for i in range(1, n+1):
for j in range(1, n+1):
dt[i, j] = 0.25*(ti[i-1, j] + ti[i, j-1] + ti[i+1, j] + ti[i, j+1]) - ti[i, j]
# Update
ti += dt
@nb.njit(fastmath=True, parallel=True)
def jacobi_nbp(n, ti, dt):
"""
Jacobi method
Parameters
----------
n : integer
size
ti : float
current time
dt : array
difference
"""
for i in nb.prange(1, n+1):
for j in range(1, n+1):
dt[i, j] = 0.25*(ti[i-1, j] + ti[i, j-1] + ti[i+1, j] + ti[i, j+1]) - ti[i, j]
# Update
for i in nb.prange(n+2):
for j in range(n+2):
ti[i,j] += dt[i,j]
n = 2048
%time solve_laplace(n, jacobi_nb, tol=5e-3)
/home/jinseok/miniconda3/envs/idp/lib/python3.9/site-packages/llvmlite/llvmpy/__init__.py:3: UserWarning: The module `llvmlite.llvmpy` is deprecated and will be removed in the future.
warnings.warn(
/home/jinseok/miniconda3/envs/idp/lib/python3.9/site-packages/llvmlite/llvmpy/core.py:8: UserWarning: The module `llvmlite.llvmpy.core` is deprecated and will be removed in the future. Equivalent functionality is provided by `llvmlite.ir`.
warnings.warn(
CPU times: user 6min 3s, sys: 5.83 s, total: 6min 8s
Wall time: 23.4 s
0.004998731199856652
# At AMD Threadripper 5955wx (16C32T)
for i in [1, 2, 4, 8, 16, 32]:
# Adjust number of threads for numba and MKL
nb.set_num_threads(i)
mkl.set_num_threads(i)
print("Number of Threads :", i)
# Measure time
%time solve_laplace(n, jacobi_nbp, tol=5e-3)
Number of Threads : 1
CPU times: user 27.3 s, sys: 79.4 ms, total: 27.4 s
Wall time: 24.6 s
Number of Threads : 2
CPU times: user 22.9 s, sys: 32 ms, total: 23 s
Wall time: 11.5 s
Number of Threads : 4
CPU times: user 23.8 s, sys: 60 ms, total: 23.9 s
Wall time: 5.97 s
Number of Threads : 8
CPU times: user 26.2 s, sys: 92.1 ms, total: 26.3 s
Wall time: 3.29 s
Number of Threads : 16
CPU times: user 33.3 s, sys: 192 ms, total: 33.5 s
Wall time: 2.09 s
Number of Threads : 32
CPU times: user 2min 20s, sys: 1.87 s, total: 2min 22s
Wall time: 4.48 s